Words: [The number of words in your document, calculated using the word_count() function at the end of this document] Figures: [The number of figures in your document. You can just count these]
The purpose of this .Rmd file is to search and take note/keep a log of any dodgy data points.
If you are unfamiliar with some of these packages, I’ve left some short code comments for you to find more information.
suppressPackageStartupMessages({
### Tidyverse Collection -----------------------------------------------------
# The tidyverse is an opinionated collection of R packages designed for
# data science. All packages share an underlying design philosophy, grammar,
# and data structures.
### https://www.tidyverse.org/
library(tidyverse)
### Data Exploration ---------------------------------------------------------
# To easily display summary statistics
### https://github.com/ropensci/skimr
library(skimr)
### Data Cleaning/Wrangling --------------------------------------------------
# To easily examine and clean dirty data
### https://www.rdocumentation.org/packages/janitor/versions/2.2.0
library(janitor)
# To easily use date-time data
### https://lubridate.tidyverse.org/
library(lubridate)
# To easily handle categorical variables using factors.
### https://forcats.tidyverse.org/
library(forcats)
### Data Visualisation -------------------------------------------------------
# To easily use already-made themes for data visualisation
### https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
library(ggthemes)
# To combine a scatter plot and a 2D density plot
### https://github.com/LKremer/ggpointdensity
library(ggpointdensity)
### Spatial Visualisation ----------------------------------------------------
# To easily manipulate spatial data
### https://r-spatial.github.io/sf/
library(sf)
# This may take a little while to install if you don't have absmapsdata already.
# Whilst I'm only using one shapefile from the package, I thought acquiring
# the data through a package would be more standardised/reproducible than
# downloading it straight from the ABS.
options(timeout = 1000)
devtools::install_github("wfmackey/absmapsdata")
# To easily access the Australian Bureau of Statistics (ABS) spatial structures
### https://github.com/wfmackey/absmapsdata
library(absmapsdata)
# To easily interact spatial data and ggplot2
### https://paleolimbot.github.io/ggspatial/
library(ggspatial)
### Misc ---------------------------------------------------------------------
# To easily standardise naming conventions based upon a consistent design
# philosophy
### https://github.com/Tazinho/snakecase
library(snakecase)
# To locate and download species observations from the Atlas of Living
# Australia
### https://galah.ala.org.au/
library(galah)
# To easily enable file referencing in project-oriented workflows
### https://here.r-lib.org/
library(here)
})## WARNING: Rtools is required to build R packages, but is not currently installed.
##
## Please download and install Rtools 4.2 from https://cran.r-project.org/bin/windows/Rtools/.
## Skipping install of 'absmapsdata' from a github remote, the SHA1 (513415b9) has not changed since last install.
## Use `force = TRUE` to force installation
Packages are all loaded in! Let’s read in a specific invasive animal species for a particular state and do some sanity checks. Ideally, we will functionalise this code to do this sanity checks on all the invasive animal species that we end up choosing for each state. Arguably, this data cleaning could be a package/RShiny dashboard in of itself!
The following code is mainly reused and inspired by the ALA Labs - An exploration of dingo observations in the ALA.
First, let’s log into the Atlas of Australia using the
galah_config() function.
# Ref [1]
# Use an Atlas of Australia-registered email (Register at ala.org.au)
galah_config(email = "johann.wagner@gmail.com")
# References:
# - [1] https://labs.ala.org.au/posts/2023-05-16_dingoes/post.htmlLet’s focus on exploring rabbits in the ACT. We will scale to other species and other states/territories later.
# Ref [1, 4]
# To download data from the ALA
rabbits <- galah_call() %>%
# To conduct a search for the scientific name of the European Rabbit
galah_identify("oryctolagus cuniculus") %>% # Ref [2]
# Filter records observed in the ACT
galah_filter(cl22 == "Australian Capital Territory") %>%
# Pre-applied filter to ensure quality-assured data
# the "ALA" profile is designed to exclude lower quality records Ref [3]
galah_apply_profile(ALA) %>%
atlas_occurrences()## This query will return 1,855 records
##
## Checking queue
## Current queue size: 1 inqueue
# References:
# - [1] https://labs.ala.org.au/posts/2023-05-16_dingoes/post.html
# - [2] https://bie.ala.org.au/species/https://biodiversity.org.au/afd/taxa/692effa3-b719-495f-a86f-ce89e2981652
# - [3] https://galah.ala.org.au/R/reference/galah_apply_profile.html
# - [4] https://galah.ala.org.au/R/reference/galah.htmlLet’s use the skim() function to explore this
dataset.
| Name | rabbits |
| Number of rows | 1855 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 2 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| scientificName | 0 | 1 | 21 | 31 | 0 | 2 | 0 |
| taxonConceptID | 0 | 1 | 73 | 73 | 0 | 2 | 0 |
| recordID | 0 | 1 | 36 | 36 | 0 | 1855 | 0 |
| dataResourceName | 0 | 1 | 9 | 58 | 0 | 9 | 0 |
| occurrenceStatus | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| decimalLatitude | 0 | 1 | -35.33 | 0.14 | -35.89 | -35.35 | -35.28 | -35.26 | -35.12 | ▁▁▁▇▇ |
| decimalLongitude | 0 | 1 | 149.12 | 0.13 | 148.79 | 149.11 | 149.14 | 149.16 | 150.77 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| eventDate | 0 | 1 | 1964-09-02 | 2023-09-28 04:34:38 | 2017-01-01 | 1088 |
Interestingly, it seems that there are two unique
scientificName, this is unexpected. Similarly, there are
also two unique values for taxonConceptID. There are 1861
occurrences of rabbits in the ACT from 9 different data providers
(dataResourceName). The most recent observation is
2023-09-28 04:34:38 and the earliest is 1964-09-02. All the observations
have longitude and latitude values, we will plot these visually in a
moment to ensure they are all in the ACT.
It will be good to create a function that checks that each observation is within the spatial boundaries of each respective state/territory.
Let’s explore what the two unique scientificName values
are.
Oh wow! This is our first data cleaning pick up! Even with the
pre-applied ALA filter we have found an error in one of the values in
scientificName. Let’s find that specific data point.
Looking at the taxonConceptID values/websites. The
“Oryctolagus cuniculus cuniculus” is actually a subspecies, not a typo!
Now, that we know it is a subspecies and not a typo, let’s note this
value down in our suspicious_data dataset and reassess
later, but keep it in our rabbits dataset.
# Let's reuse the same columns/column names as in rabbits
rabbits_suspicious <- rabbits %>%
filter(recordID == FALSE) %>%
mutate(
suspicious_notes = character(),
date = character()
) %>%
# Add the suspicious data
add_row({
rabbits %>%
filter(scientificName == "Oryctolagus cuniculus cuniculus") %>%
mutate(
suspicious_notes = "This is a subspecies of the Oryctolagus cuniculus species and is the only one in the ACT.",
date = "2023-10-05"
)
})
rabbits_suspiciousThis suspicious data point is from the iNaturalist Australia data provider, which is a Citizen Science project, so potentially more cautious investigation should go into these observations. Let’s continue onto the spatial visualisation.
Let’s visualise the rabbits data spatially by creating a map with data points of the observations.
# Ref [1]: Create spatial visualisation of the ACT
state2021 %>%
filter(state_name_2021 == "Australian Capital Territory") %>%
ggplot() +
# Ref [1]: Create background polygon of the ACT
geom_sf(
aes(geometry = geometry)
) +
geom_point(
data = rabbits,
aes(
x = decimalLongitude,
y = decimalLatitude
),
alpha = 0.6
) +
coord_sf() +
theme_minimal()Interestingly, it seems like there are a select number of observations outside of the ACT borders, which I assume is Jervis Bay territory. Let’s spatially isolate these data points and see if they really are in the Jervis Bay territory.
# Ref [1]: Create spatial visualisation of the ACT
state2021 %>%
filter(state_name_2021 == "Australian Capital Territory") %>%
ggplot() +
# Ref [1]: Create background polygon of the ACT
geom_sf(
aes(geometry = geometry)
) +
# Plot the Jervis Bay polygon
geom_sf(
data = {
suburb2021 %>%
filter(suburb_name_2021 == "Jervis Bay")
},
aes(geometry = geometry)
) +
geom_point(
data = rabbits,
aes(
x = decimalLongitude,
y = decimalLatitude
),
alpha = 0.6
) +
coord_sf() +
theme_minimal()Hypothesis has been confirmed! The data points are from the Jervis Bay territory.
For the sake/scope of this assignment, we will put these data points
into the excluded_data dataset and remove these from our
main dataset. Because I really only want to showcase observations in the
main state/territory borders, as this RShiny Dashboard tool should be
used for macro-scale, not small territories.
Let’s create an sf data type so that we can use the
st_filter() function, which filters the spatial data points
within a given spatial polygon. In our case, we only want to include the
data points that are within the ACT border boundary polygon.
rabbits_sf_clean <- rabbits %>%
# Ref [1]: Convert rabbits tibble into an sf data type
st_as_sf(
coords = c("decimalLongitude", "decimalLatitude"),
crs = "+proj=longlat +datum=WGS84"
) %>%
# Ref [2]: Filter the data points that are within the ACT
st_filter({
state2021 %>%
filter(state_name_2021 == "Australian Capital Territory")}
)
rabbits_sf_clean# References:
# - [1] https://stackoverflow.com/a/52951856/22410914
# - [2] https://rdrr.io/cran/sf/man/st_as_sf.htmlWe’ve now filtered out the excluded data points; however, we want to
keep using the tibble data type, so let’s do some
joins.
rabbits_clean <- rabbits %>%
# Find all the clean observations
right_join(
rabbits_sf_clean,
join_by(
eventDate,
scientificName,
taxonConceptID,
recordID,
dataResourceName,
occurrenceStatus
)
)
rabbits_cleanThis is now the clean rabbits dataset for the ACT.
Let’s look at the excluded datasets.
rabbits_excluded <- rabbits %>%
# Find all the excluded observations
anti_join(
rabbits_sf_clean,
join_by(
eventDate,
scientificName,
taxonConceptID,
recordID,
dataResourceName,
occurrenceStatus
)
)
rabbits_excludedLet’s plot these points just for sanity checks.
# Ref [1]: Create spatial visualisation of the ACT
state2021 %>%
filter(state_name_2021 == "Australian Capital Territory") %>%
ggplot() +
# Ref [1]: Create background polygon of the ACT
geom_sf(
aes(geometry = geometry)
) +
# Plot the Jervis Bay polygon
geom_sf(
data = {
suburb2021 %>%
filter(suburb_name_2021 == "Jervis Bay")
},
aes(geometry = geometry)
) +
geom_point(
data = rabbits_excluded,
aes(
x = decimalLongitude,
y = decimalLatitude
),
alpha = 0.6
) +
coord_sf() +
theme_minimal() +
labs(title = "Excluded Data Points")It seems like we have picked up a few extra points that seem to be just outside the ACT border boundaries. Because of the scope of the assignment, we will keep excluding all of these points. It is likely that this is either a misalignment with the ACT boundary from the ABS and the mapping data input software used by the data providers. It could also be that the borders have slightly changed in the past as the ABS boundary is the one used in 2021. There could also be an error in the GPS location data that is causing these points to be outside the ACT boundary. Regardless of the reason, due to the scope of this assignment, these points will be excluded.
# Let's reuse the same columns/column names as in rabbits
rabbits_excluded <- rabbits_excluded %>%
# Add excluded_notes
mutate(
date = "2023-10-05",
excluded_notes = "These data points are outside the ABS ACT boundary. They include points that are just outside the boundary within a few kilometers and points that are in Jervis Bay"
)
rabbits_excludedLet’s make a final spatial visualisation.
# Ref [1]: Create spatial visualisation of the ACT
state2021 %>%
filter(state_name_2021 == "Australian Capital Territory") %>%
ggplot() +
# Ref [1]: Create background polygon of the ACT
geom_sf(
aes(geometry = geometry)
) +
geom_point(
data = rabbits_clean,
aes(
x = decimalLongitude,
y = decimalLatitude
),
alpha = 0.6
) +
coord_sf() +
theme_minimal() +
# Ref [2]
labs(title = "Rabbit observations in the ACT")Ideally, we want to create functions that do all of this cleaning for us.
Let’s start with a function that extracts the raw data for a particular species for all of Australia. We will sort the state/territory later.
load_galah_occurrence_data <- function(
# character string of scientific name of species
species_name
) {
species_occurrence <- galah_call() %>%
# To conduct a search for species scientific name
galah_identify(species_name) %>% # Ref [2]
# Pre-applied filter to ensure quality-assured data
# the "ALA" profile is designed to exclude lower quality records Ref [3]
galah_apply_profile(ALA) %>%
atlas_occurrences()
return(species_occurrence)
}Let’s test the load_galah_occurrence_data function.
## This query will return 113,383 records
##
## Checking queue
## Current queue size: 1 inqueue running .....
| Name | rabbits_all_states |
| Number of rows | 113383 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 2 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| scientificName | 0 | 1 | 21 | 31 | 0 | 2 | 0 |
| taxonConceptID | 0 | 1 | 73 | 73 | 0 | 2 | 0 |
| recordID | 0 | 1 | 36 | 36 | 0 | 113383 | 0 |
| dataResourceName | 0 | 1 | 8 | 60 | 0 | 36 | 0 |
| occurrenceStatus | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| decimalLatitude | 151 | 1 | -33.84 | 4.28 | -54.58 | -36.85 | -34.25 | -32.77 | 64.13 | ▇▂▁▁▁ |
| decimalLongitude | 151 | 1 | 143.30 | 7.63 | -122.80 | 140.43 | 144.28 | 149.09 | 175.70 | ▁▁▁▁▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| eventDate | 3097 | 0.97 | 1824-01-01 | 2023-10-01 01:12:00 | 2016-04-25 | 17498 |
It seems like there are some missing values for
decimalLatitude and decimalLongitude, as well
as for eventDate. Let’s explore these values.
rabbits_all_states %>%
filter(is.na(decimalLatitude)) %>%
group_by(dataResourceName) %>%
count() %>%
arrange(desc(n))rabbits_all_states %>%
filter(is.na(eventDate)) %>%
group_by(dataResourceName) %>%
count() %>%
arrange(desc(n))It seems that most of these missing values are from museum providers
for OZCAM. It also seems that the Tasmanian Values Atlas have not
provided any spatial information. Let’s put all of these data points
into the excluded_data dataset.
Let’s create a function that will do this automatically.
remove_na_values <- function(species_occurrence) {
species_occurrence_clean <- species_occurrence %>%
filter(
!is.na(eventDate),
!is.na(decimalLatitude),
!is.na(decimalLongitude)
)
return(species_occurrence_clean)
}
add_na_eventdate_to_excluded_data <- function(species_occurrence) {
# Let's reuse the same columns/column names as in rabbits
add_excluded <- species_occurrence %>%
filter(is.na(eventDate)) %>%
# Add excluded_notes
mutate(
date = Sys.Date(),
excluded_notes = "These data points have a missing eventDate value."
)
excluded_data <- bind_rows(excluded_data, add_excluded)
return(excluded_data)
}
add_na_spatial_to_excluded_data <- function(species_occurrence) {
# Let's reuse the same columns/column names as in rabbits
add_excluded <- species_occurrence %>%
filter(
is.na(decimalLatitude),
is.na(decimalLongitude)
) %>%
# Add excluded_notes
mutate(
date = Sys.Date(),
excluded_notes = "These data points have missing spatial values."
)
excluded_data <- bind_rows(excluded_data, add_excluded)
return(excluded_data)
}Now, let’s use the functions and remove and log the NA
values.
rabbits_all_states_clean <- rabbits_all_states %>%
remove_na_values()
# Initialise empty excluded_data tibble.
excluded_data <- tibble()
excluded_data <- rabbits_all_states %>%
add_na_eventdate_to_excluded_data()
excluded_data <- rabbits_all_states %>%
add_na_spatial_to_excluded_data()
rabbits_all_states_cleanNow, let’s convert rabbits_all_states_clean into an sf
so we can find out which point is within each state/territory. Then we
can use the st_filter function to filter out each point
that is within each respective state and create a new column called
state that will tell us which point is in which state.
# Ref [1,2]
add_state_column <- function(species_all_states_clean_input) {
# Make sure both your tibble and shapefile have the same coordinate reference
# system (CRS)
species_all_states_clean_sf <- st_as_sf(
species_all_states_clean_input,
coords = c("decimalLongitude", "decimalLatitude"),
crs = st_crs(state2021)
)
# Create a vector of state/territory names
state_names <- c(
"New South Wales",
"Victoria",
"Queensland",
"South Australia",
"Western Australia",
"Tasmania",
"Northern Territory",
"Australian Capital Territory"
)
# Initialise empty list
species_all_states_clean_list <- list()
# Ref [3]: Create a separate tibble for all points within each state and
# put in list
for (state_name in state_names) {
species_state_clean <- species_all_states_clean_sf %>%
st_filter({state2021 %>% filter(state_name_2021 == state_name)}) %>%
mutate(state = state_name) %>%
as_tibble()
species_all_states_clean_list[[state_name]] <- species_state_clean
}
# Bind all 8 tibbles back into one big tibble
species_all_states_clean_output <- species_all_states_clean_input %>%
# Join the new columns (geometry, state) to the input tibble
left_join(
{bind_rows(species_all_states_clean_list)},
join_by(
eventDate,
scientificName,
taxonConceptID,
recordID,
dataResourceName,
occurrenceStatus
)
)
return(species_all_states_clean_output)
}
rabbits_all_states_clean <- rabbits_all_states %>%
remove_na_values() %>%
add_state_column()
rabbits_all_states_cleanGreat! Now we have a cleaned tibble, where we have removed values
that have missing spatial data and missing eventDate
values, as well as added a state column indicating, which point is in
which state. Let’s quickly do some sanity checks on the
state column.
It seems like there are several points that are not in any of the state/territory polygons. Let’s investigate these as we will likely just exclude these points.
rabbits_all_states_clean <- rabbits_all_states_clean %>%
mutate(
state = case_when(
is.na(state) ~ "Not within any state/territory",
.default = state
)
)
state2021 %>%
ggplot() +
# Ref [1]: Create background polygon of the ACT
geom_sf(
aes(geometry = geometry)
) +
geom_point(
data = rabbits_all_states_clean,
aes(
x = decimalLongitude,
y = decimalLatitude,
colour = state
),
alpha = 0.6
) +
scale_colour_viridis_d() +
coord_sf() +
theme_minimal() +
# Ref [2]
labs(title = "Rabbit observations in Australia")Ah hah! Interestingly, the dataset that I am using must not be filtered for just observations in Australia. Good thing we did this sanity check! Let’s just show the not within any state/territory points, just to make sure.
rabbits_all_states_clean <- rabbits_all_states_clean %>%
mutate(
state = case_when(
is.na(state) ~ "Not within any state/territory",
.default = state
)
)
state2021 %>%
ggplot() +
# Ref [1]: Create background polygon of the ACT
geom_sf(
aes(geometry = geometry)
) +
geom_point(
data = {
rabbits_all_states_clean %>%
filter(state == "Not within any state/territory")
},
aes(
x = decimalLongitude,
y = decimalLatitude,
colour = state
),
alpha = 0.6
) +
scale_colour_viridis_d() +
coord_sf() +
theme_minimal() +
# Ref [2]
labs(title = "Rabbit observations in Australia",
subtitle = "Only observations that didn't fall within any state/territory")Interestingly, it seems that there are quite a few observations that must be along the coastline / in the water. Let’s just look at NSW.
rabbits_all_states_clean <- rabbits_all_states_clean %>%
mutate(
state = case_when(
is.na(state) ~ "Not within any state/territory",
.default = state
)
)
state2021 %>%
ggplot() +
# Ref [1]: Create background polygon of the ACT
geom_sf(
aes(geometry = geometry)
) +
geom_point(
data = {
rabbits_all_states_clean %>%
filter(state == "Not within any state/territory")
},
aes(
x = decimalLongitude,
y = decimalLatitude,
colour = state
),
alpha = 0.6
) +
scale_colour_viridis_d() +
coord_sf() +
theme_minimal() +
# Ref [2]
labs(title = "Rabbit observations in Australia")